Analisi dei dati

Possibili analisi:

  • Chi sono gli utenti che twittano di più per giornata
  • Quanti tweet per utente ci sono in media per giornata
  • Raffigura con folium gli utenti sulla mappa (già fatto nell'altra analisi)
  • Semantica: vedi gli hashtag e analizzali, quali sono i più frequenti ecc.
  • In [9]:
    import numpy as np
    import pandas as pd
    %matplotlib inline
    import math
    import numpy as np
    import scipy.stats as stats
    import matplotlib.mlab as mlab
    import matplotlib.pyplot as plt
    import matplotlib.patches as mpatches
    import seaborn as sns 
    from datetime import datetime
    from pandas import Timestamp
    import collections
    import os
    
    dirname = '../csv'
    
    #for csv in os.listdir(dirname):
        # leggo il csv
        #df.drop(['Unnamed: 0'], axis='columns', inplace=True)
        #df['Created_At'] = pd.to_datetime(df['Created_At'])
    
    df = pd.read_csv('../csv/df_uidFiles_2016.csv')
    df.drop(['Unnamed: 0'], axis='columns', inplace=True)
    df['Created_At'] = pd.to_datetime(df['Created_At'])
    df.head()
    
    Out[9]:
    Screen_name UserID TweetID Coords Lat Lon Created_At Text
    0 madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 2016-09-22 21:37:51+00:00 Cieli infuocati.\n\n#picoftheday #quotesofthed...
    1 madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 2016-09-29 22:05:13+00:00 Prospettive.. \nunite a casa #ilselfone\n#team...
    2 madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 2016-09-30 14:58:19+00:00 Non occorre essere matti per lavorare qui, ma ...
    3 madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 2016-09-25 11:19:32+00:00 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni...
    4 madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 2016-09-23 22:11:31+00:00 La vita è come la fotografia sono necessari i ...
    In [10]:
    #Preparazione per la mappa folium successiva
    
    locations = df[['Lat', 'Lon']]
    locationlist = locations.values.tolist()
    
    In [11]:
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 637 entries, 0 to 636
    Data columns (total 8 columns):
     #   Column       Non-Null Count  Dtype              
    ---  ------       --------------  -----              
     0   Screen_name  637 non-null    object             
     1   UserID       637 non-null    int64              
     2   TweetID      637 non-null    int64              
     3   Coords       637 non-null    object             
     4   Lat          637 non-null    float64            
     5   Lon          637 non-null    float64            
     6   Created_At   637 non-null    datetime64[ns, UTC]
     7   Text         637 non-null    object             
    dtypes: datetime64[ns, UTC](1), float64(2), int64(2), object(3)
    memory usage: 39.9+ KB
    
    In [12]:
    df.describe()
    
    Out[12]:
    UserID TweetID Lat Lon
    count 6.370000e+02 6.370000e+02 637.000000 637.000000
    mean 1.796320e+16 7.809857e+17 43.719458 10.394245
    std 1.120336e+17 1.200428e+15 0.005916 0.007364
    min 9.900982e+06 7.790024e+17 43.695943 10.378996
    25% 1.020629e+08 7.798249e+17 43.716700 10.389450
    50% 3.123077e+08 7.810749e+17 43.722511 10.396417
    75% 8.177871e+08 7.819278e+17 43.723056 10.396777
    max 7.528317e+17 7.830572e+17 43.731070 10.425483

    Vedo che il cluster con più dati e quindi utenti che vi hanno tweettato è il 1 con 306 dati su tweet, mentre già dal -1 (il terzo cluster con più utenti) ci sono solo 62 utenti, quasi 1/6 del primo cluster.

    In [13]:
    users = df['Screen_name']
    users.value_counts()
    
    Out[13]:
    Colomboalejo      22
    minie_0           17
    LaScalettaPisa    16
    nelisaMM          12
    LuisConcept       11
                      ..
    BuuenaOnda         1
    dafreaky79         1
    IbnuWidiant        1
    LuCappelletto      1
    stelios4k          1
    Name: Screen_name, Length: 336, dtype: int64
    In [14]:
    users = df['UserID']
    users.value_counts()
    
    Out[14]:
    1372289694    22
    102062879     17
    2327332688    16
    932237454     12
    235625484     11
                  ..
    200690211      1
    2397692442     1
    356648473      1
    805577240      1
    20674560       1
    Name: UserID, Length: 336, dtype: int64

    L'utente che ha tweettato di più è Colombalejo (User ID = 1372289694) con 22 tweet.

    In [15]:
    df['Day'] = ""
        
    for i in range (0, len(df['Created_At'])):
        day = df['Created_At'][i]
        df['Day'][i] = df['Day'][i].replace("", str(day.day))
    df
    
    <ipython-input-15-143037f09b11>:5: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame
    
    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
      df['Day'][i] = df['Day'][i].replace("", str(day.day))
    
    Out[15]:
    Screen_name UserID TweetID Coords Lat Lon Created_At Text Day
    0 madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 2016-09-22 21:37:51+00:00 Cieli infuocati.\n\n#picoftheday #quotesofthed... 22
    1 madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 2016-09-29 22:05:13+00:00 Prospettive.. \nunite a casa #ilselfone\n#team... 29
    2 madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 2016-09-30 14:58:19+00:00 Non occorre essere matti per lavorare qui, ma ... 30
    3 madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 2016-09-25 11:19:32+00:00 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni... 25
    4 madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 2016-09-23 22:11:31+00:00 La vita è come la fotografia sono necessari i ... 23
    ... ... ... ... ... ... ... ... ... ...
    632 antoniocassisa 358042635 781879291911016448 [43.7167, 10.3833] 43.716700 10.383300 2016-09-30 15:32:04+00:00 I mì ómini \n#son #figli #boys @ Pisa, Italy h... 30
    633 SefaMermer 293157588 780753755830677504 [43.7167, 10.3833] 43.716700 10.383300 2016-09-27 12:59:35+00:00 #love #tbt #tagforlikes #TFLers #tweegram #pho... 27
    634 SefaMermer 293157588 780756143668953088 [43.7167, 10.3833] 43.716700 10.383300 2016-09-27 13:09:05+00:00 #love #tbt #tagforlikes #TFLers #tweegram #pho... 27
    635 matteluca89 494389053 779638196258811904 [43.71544235, 10.40051616] 43.715442 10.400516 2016-09-24 11:06:45+00:00 Last saturday I went out with my #chinese teac... 24
    636 anabrmotta 98254561 781123690343698432 [43.72263, 10.3948] 43.722630 10.394800 2016-09-28 13:29:35+00:00 Já que é pra tombar, ela tombou (só um pouquin... 28

    637 rows × 9 columns

    In [16]:
    unique_users = df.groupby(['Day'])['Screen_name'].unique() #.count() #.sort_values(ascending=False)
    unique_users
    
    Out[16]:
    Day
    1     [ErsenTArt, sifumalacarne, ericrosbh, passaret...
    2     [CMackHoops, ParallelRae, harrievanhout33, Cri...
    22    [madikeeper12, MicheleGaddi94, sifumalacarne, ...
    23    [madikeeper12, DeliberateFare, Nunniet, xostep...
    24    [ashleybear52, ElyPardy, gabrielflorin2, Crist...
    25    [madikeeper12, Mr_R1cardo, plutoniummuffin, Cr...
    26    [Mr_R1cardo, Vygotskij, Alexandros76, christab...
    27    [plutoniummuffin, sifumalacarne, brauliocgil, ...
    28    [jhuelga, maca_gabi, gabrielflorin2, MrOranjeT...
    29    [madikeeper12, MarielenaP, bubudriri, ChelseaM...
    3     [Mattar79, minie_0, diggino, HelloMaycille, Mi...
    30    [madikeeper12, TweetsByRyland, Mhegazy79, Mich...
    Name: Screen_name, dtype: object
    In [17]:
    #raggruppo il df in base al giorno
    Days = df.groupby(['Day'])
    
    i=0
    
    max_users = {}
    max_users_perDay = {}
    
    for group in Days.groups:
        day = Days.get_group(group)
        users = day['Screen_name'].tolist()
        users_collection = collections.Counter(users)
        dict(users_collection)
        max_users[max(users_collection, key = users_collection.get)] = max(users_collection.values())
        '''
        if(int(day['Day'].values[0]) < 15):
            max_users_perDay[day['Day'].values[0] + ' Oct'] = max(users_collection, key = users_collection.get)
        else:
            max_users_perDay[day['Day'].values[0] + ' Sep'] = max(users_collection, key = users_collection.get)
        '''
        max_users_perDay[pd.to_datetime(np.datetime64(day['Created_At'].values[0])).date()] = max(users_collection, key = users_collection.get)
    
    In [18]:
    max_users_perDay
    
    Out[18]:
    {datetime.date(2016, 10, 1): 'crissemedo',
     datetime.date(2016, 10, 2): '_martyn_fisher',
     datetime.date(2016, 9, 22): 'Insgaet',
     datetime.date(2016, 9, 23): 'cocoy_39',
     datetime.date(2016, 9, 24): 'nelisaMM',
     datetime.date(2016, 9, 25): 'elle__erre',
     datetime.date(2016, 9, 26): 'christaboffa',
     datetime.date(2016, 9, 27): 'roastd_chestnut',
     datetime.date(2016, 9, 28): 'CBanks4U',
     datetime.date(2016, 9, 29): 'mcmack08',
     datetime.date(2016, 10, 3): 'minie_0',
     datetime.date(2016, 9, 30): 'Colomboalejo'}
    In [19]:
    max_users
    
    Out[19]:
    {'crissemedo': 4,
     '_martyn_fisher': 4,
     'Insgaet': 5,
     'cocoy_39': 4,
     'nelisaMM': 8,
     'elle__erre': 4,
     'christaboffa': 3,
     'roastd_chestnut': 3,
     'CBanks4U': 6,
     'mcmack08': 10,
     'minie_0': 17,
     'Colomboalejo': 22}
    In [20]:
    df_max_users = pd.DataFrame(data=max_users_perDay.items(), columns=['Day', 'User'])
    df_max_users['Number of tweets'] = max_users.values()
    df_max_users = df_max_users.sort_values(by=['Day']).reset_index(drop=True)
    df_max_users
    
    Out[20]:
    Day User Number of tweets
    0 2016-09-22 Insgaet 5
    1 2016-09-23 cocoy_39 4
    2 2016-09-24 nelisaMM 8
    3 2016-09-25 elle__erre 4
    4 2016-09-26 christaboffa 3
    5 2016-09-27 roastd_chestnut 3
    6 2016-09-28 CBanks4U 6
    7 2016-09-29 mcmack08 10
    8 2016-09-30 Colomboalejo 22
    9 2016-10-01 crissemedo 4
    10 2016-10-02 _martyn_fisher 4
    11 2016-10-03 minie_0 17
    In [21]:
    import plotly
    import pandas as pd
    import numpy as np
    import seaborn as sns
    import plotly.express as px
    
    x = df_max_users['Day'] 
    y = df_max_users['Number of tweets']
    
    In [23]:
    import plotly.express as px
    data = px.data.gapminder()
    
    fig = px.bar(df_max_users, x=x, y=y,
                 color=y, hover_data=['User', 'Number of tweets'], height=400,
                title="Users with the highest number of tweets per day in 2016")
    fig.show()
    
    In [24]:
    import folium
    from folium import plugins
    from shapely.geometry import Point, Polygon, LineString
    
    #le coordinate sono: [(43.7359, 10.4269), (43.6955, 10.3686)]
    lat = [43.7359, 43.6955]
    lon = [10.4269, 10.3686]
    
    lat_mean = np.mean(lat)
    lon_mean = np.mean(lon)
    
    coords = df['Coords']
    m = folium.Map(location=[lat_mean, lon_mean], tiles='Stamen Toner', zoom_start=13.2, control_scale=True)
       
    for point in range(0, len(locationlist)):
        folium.Marker(locationlist[point], popup=df['Screen_name'][point]).add_to(m)
     
    m
    
    Out[24]:
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [25]:
    map2 = folium.Map(location=[lat_mean, lon_mean], tiles='CartoDB positron', zoom_start=13.2)
    
    marker_cluster = folium.plugins.MarkerCluster().add_to(map2)
    
    for point in range(0, len(locationlist)):
        folium.Marker(locationlist[point], popup=df['Screen_name'][point]).add_to(marker_cluster)
    map2
    
    Out[25]:
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [26]:
    for group in Days.groups:
        globals()[f"day{group}"] = Days.get_group(group).reset_index(drop=True)
    

    Abbiamo:

  • day22
  • day23
  • day24
  • day25
  • day26
  • day27
  • day28
  • day29
  • day30
  • day01
  • day02
  • day03
  • In [27]:
    #Prova con day22
    
    locations_day22 = day22[['Lat', 'Lon']]
    locationlist_day22 = locations_day22.values.tolist()
    
    In [28]:
    map_day22 = folium.Map(location=[lat_mean, lon_mean], tiles='CartoDB positron', zoom_start=15)
    
    marker_cluster_day22 = folium.plugins.MarkerCluster().add_to(map_day22)
    
    for point in range(0, len(locationlist_day22)):
        folium.Marker(locationlist_day22[point], popup=df['Screen_name'][point]).add_to(marker_cluster_day22)
    map_day22
    
    Out[28]:
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [ ]: